Dimensionality Reduction

What is Dimensionality Reduction

Dimensionality reduction reduces the number of input features while preserving as much meaningful information as possible.

There are two main approaches:

= selects a subset of the original features based on their importance, relevance, or redundancy

Forward Selection: Add features one by one based on performance improvement
Backward Elimination: Start with all features and remove the least useful
Bidirectional Elimination: Combines forward selection and backward elimination.
Score Comparison: Use statistical or model-based scores (e.g., ANOVA F-test, mutual information).

= transforms original features into a lower-dimensional space using mathematical techniques

Principal Component Analysis (PCA)
- linear technique that finds orthogonal components maximizing variance in the data
Linear Discriminant Analysis (LDA)
- supervised method that projects data to maximize class separability
Quadratic Discriminant Analysis (QDA)
- similar to LDA but assumes each class has its own covariance matrix, allowing quadratic decision boundaries
Kernel PCA
- extends PCA using kernel methods to capture non-linear patterns
t-Distributed Stochastic Neighbor Embedding (t-SNE)
- non-linear technique for visualizing high-dimensional data in 2D/3D, optimized for preserving local structure
Uniform Manifold Approximation and Projection (UMAP)
- fast and scalable non-linear dimensionality reduction method, preserves both global and local structure
- similar to t-SNE but faster and often better at retaining overall shape
Autoencoders
- neural network-based models that learn a compressed latent representation through reconstruction

PCA capture the variability; LDA class separation
PCA is unsupervised; LDA is supervised (because of the relation to the dependent variable)

it creates new axes to maximize the class-separation, step by step (2-class example):
1. maximize the distance between class means $μ_{1}$ and $μ_{2}$
2. minimize the variation (which LDA calls "scatter") within each class $s_{1}$ and $s_{2}$
3. compute $\frac{(μ_{1} - μ_{2})^{2}}{s_{1}^{2} + s_{2}^{2}}$ (= $\frac{d_{1, 2}^{2}}{s_{1}^{2} + s_{2}^{2}}$ ) and LDA should maximize it
4. if more than 2 classes, e.g. there are 3 classes, the value to be maximized will be $\frac{d_{1, 2}^{2} + d_{1, 3}^{2} + d_{2, 3}^{2}}{s_{1}^{2} + s_{2}^{2} + s_{3}^{2}}$

t-SNE takes high-dimensional data and reduces it to a low-dimensional graph (2-D typically)
- it calculates a similarity measure based on the distance between points instead of trying to maximize variance.
Unlike PCA (which is linear), t-SNE can reduce dimensions with non-linear relationships (such as “Swiss Roll” non-linear distribution)